skip to main content


Search for: All records

Creators/Authors contains: "Srinath, Mukund"

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Companies' privacy policies and their contents are being analyzed for many reasons, including to assess the readability, usability, and utility of privacy policies; to extract and analyze data practices of apps and websites; to assess compliance of companies with relevant laws and their own privacy policies, and to develop tools and machine learning models to summarize and read policies. Despite the importance and interest in studying privacy policies from researchers, regulators, and privacy activists, few best practices or approaches have emerged and infrastructure and tool support is scarce or scattered. In order to provide insight into how researchers study privacy policies and the challenges they face when doing so, we conducted 26 interviews with researchers from various disciplines who have conducted research on privacy policies. We provide insights on a range of challenges around policy selection, policy retrieval, and policy content analysis, as well as multiple overarching challenges researchers experienced across the research process. Based on our findings, we discuss opportunities to better facilitate privacy policy research, including research directions for methodologically advancing privacy policy analysis, potential structural changes around privacy policies, and avenues for fostering an interdisciplinary research community and maturing the field. 
    more » « less
    Free, publicly-accessible full text available October 1, 2024
  2. Terms of service documents are a common feature of organizations' websites. Although there is no blanket requirement for organizations to provide these documents, their provision often serves essential legal purposes. Users of a website are expected to agree with the contents of a terms of service document, but users tend to ignore these documents as they are often lengthy and difficult to comprehend. As a step towards understanding the landscape of these documents at a large scale, we present a first-of-its-kind terms of service corpus containing 247,212 English language terms of service documents obtained from company websites sampled from Free Company Dataset. We examine the URLs and contents of the documents and find that some websites that purport to post terms of service actually do not provide them. We analyze reasons for unavailability and determine the overall availability of terms of service in a given set of website domains. We also identify that some websites provide an agreement that combines terms of service with a privacy policy, which is often an obligatory separate document. Using topic modeling, we analyze the themes in these combined documents by comparing them with themes found in separate terms of service and privacy policies. Results suggest that such single-page agreements miss some of the most prevalent topics available in typical privacy policies and terms of service documents and that many disproportionately cover privacy policy topics as compared to terms of service topics. 
    more » « less
  3. Organisations disclose their privacy practices by posting privacy policies on their websites. Even though internet users often care about their digital privacy, they usually do not read privacy policies, since understanding them requires a significant investment of time and effort. Natural language processing has been used to create experimental tools to interpret privacy policies, but there has been a lack of large privacy policy corpora to facilitate the creation of large-scale semi-supervised and unsupervised models to interpret and simplify privacy policies. Thus, we present the PrivaSeer Corpus of 1,005,380 English language website privacy policies collected from the web. The number of unique websites represented in PrivaSeer is about ten times larger than the next largest public collection of web privacy policies, and it surpasses the aggregate of unique websites represented in all other publicly available privacy policy corpora combined. We describe a corpus creation pipeline with stages that include a web crawler, language detection, document classification, duplicate and near-duplicate removal, and content extraction. We employ an unsupervised topic modelling approach to investigate the contents of policy documents in the corpus and discuss the distribution of topics in privacy policies at web scale. We further investigate the relationship between privacy policy domain PageRanks and text features of the privacy policies. Finally, we use the corpus to pretrain PrivBERT, a transformer-based privacy policy language model, and obtain state of the art results on the data practice classification and question answering tasks. 
    more » « less